The Forest or the Trees? Tackling Simpson's Paradox with Classi fication and Regression Trees
نویسندگان
چکیده
منابع مشابه
The Forest or the Trees? Tackling Simpson's Paradox in Big Data Using Trees
Prediction and variable selection are major uses of data mining algorithms but they are rarely the focus in causal IS research. Because experiments are often impossible, unethical or expensive to perform, causal IS research often relies on observational data. A major challenge is to infer causality from such data. Simpson’s paradox can arise in such contexts, causing uncertainty regarding the r...
متن کاملTackling Simpson's Paradox in Big Data using Classification & Regression Trees
This work is aimed at finding potential Simpson’s paradoxes in Big Data. Simpson’s paradox (SP) arises when choosing the level of data aggregation for causal inference. It describes the phenomenon where the direction of a cause on an effect is reversed when examining the aggregate vs. disaggregates of a sample or population. The practical decision making dilemma that SP raises is which level of...
متن کاملCORT: classification or regression trees
In this paper we challenge three of the underlying principles of CART, a well know approach to the construction of classification and regression trees. Our primary concern is with the penalization strategy employed to prune back an initial, overgrown tree. We reason, based on both intuitive and theoretical arguments, that the pruning rule for classification should be different from that used fo...
متن کاملModel-Based Classi cation Trees
The construction of classiication trees is nearly always top-down, locally optimal and data-driven. Such recursive designs are often globally ineecient, for instance in terms of the mean depth necessary to reach a given classiication rate. We consider statistical models for which exact global optimization is feasible, and thereby demonstrate that recursive and global procedures may result in ve...
متن کاملOutlier Detection by Boosting Regression Trees
A procedure for detecting outliers in regression problems is proposed. It is based on information provided by boosting regression trees. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate after removing it. The selection criterion is based on Tchebychev’s inequality applied to the maximum over the boosting iterations of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: SSRN Electronic Journal
سال: 2014
ISSN: 1556-5068
DOI: 10.2139/ssrn.2392953